Search CORE

76 research outputs found

TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation

Author: Fares Mireille
Obin Nicolas
Pelachaud Catherine
Publication venue
Publication date: 08/08/2023
Field of study

This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content

arXiv.org e-Print Archive

Design and Evaluation of Shared Prosodic Annotation for Spontaneous French Speech: From Expert Knowledge to Non-Expert Annotation

Author: Avanzi Mathieu
Lacheret Anne
Obin Nicolas
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceIn the area of large French speech corpora, there is a demonstrated need for a common prosodic notation system allowing for easy data exchange, comparison, and automatic annotation. The major questions are: (1) how to develop a single simple scheme of prosodic transcription which could form the basis of guidelines for non-expert manual annotation (NEMA), used for linguistic teaching and research; (2) based on this NEMA, how to establish reference prosodic corpora (RPC) for different discourse genres (Cresti and Moneglia, 2005); (3) how to use the RPC to develop corpus-based learning methods for automatic prosodic labelling in spontaneous speech (Buhman et al., 2002; Tamburini and Caini 2005, Avanzi, et al. 2010). This paper presents two pilot experiments conducted with a consortium of 15 French experts in prosody in order to provide a prosodic transcription framework (transcription methodology and transcription reliability measures) and to establish reference prosodic corpora in French

CiteSeerX

Stylization and Trajectory Modelling of Short and Long Term Speech Prosody Variations

Author: Lacheret Anne
Obin Nicolas
Rodet Xavier
Publication venue: HAL CCSD
Publication date: 28/08/2011
Field of study

International audienceIn this paper, a unified trajectory model based on the stylization and the modelling of f0 variations simultaneously over various temporal domains is proposed. The syllable is used as the minimal temporal domain for the description of speech prosody, and short-term and long-term f0 variations are stylized and modelled simultaneously over various temporal domains. During the training, a context-dependent model is estimated according to the joint stylized f0 contours over the syllable and a set of long-term temporal domains. During the synthesis, f0 variations are determined using the long-term variations as trajectory constraints. In a subjective evaluation in speech synthesis, the stylization and trajectory modelling of short and long term speech prosody variations is shown to consistently model speech prosody and to outperform the conventional short-term modelling

A Multi-Level Context-Dependent Prosodic Model applied to duration modeling

Author: Lacheret-Dujour Anne
Obin Nicolas
Rodet Xavier
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

International audienceon the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllablebased durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model's complexity, when showing comparable performance in terms of relative prediction error. Index Terms : speech synthesis, prosody, multi-level model, context-dependent model

CiteSeerX

Voice Reenactment with F0 and timing constraints and adversarial learning of conversions

Author: Benaroya Laurent
Bous Frederik
Obin Nicolas
Roebel Axel
Publication venue
Publication date: 31/05/2022
Field of study

This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural- VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conversion. First, the S2S-VC architecture is modified so as to synchronize the converted speech with the source speech by mean of phonetic duration encoding; second, the decoder is conditioned on the desired sequence of F0- values and an explicit F0-loss is formulated between the F0 of the source speaker and the one of the converted speech. Besides, an adversarial learning of conversions is integrated within the S2S-VC architecture so as to exploit both advantages of reconstruction of original speech and converted speech with manipulated attributes during training and then reducing the inconsistency between training and conversion. An experimental evaluation on the VCTK speech database shows that the speech prosody can be efficiently preserved during conversion, and that the proposed adversarial learning consistently improves the conversion and the naturalness of the reenacted speech.Comment: arXiv admin note: text overlap with arXiv:2107.1234

arXiv.org e-Print Archive

Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding

Author: Fares Mireille
Grimaldi Michele
Obin Nicolas
Pelachaud Catherine
Publication venue
Publication date: 03/08/2022
Field of study

Modeling virtual agents with behavior style is one factor for personalizing human agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers. We view style as being pervasive while speaking, it colors the communicative behaviors expressivity while speech content is carried by multimodal signals and text. This disentanglement scheme of content and style allows us to directly infer the style embedding even of speaker whose data are not part of the training phase, without requiring any further training or fine tuning. The first goal of our model is to generate the gestures of a source speaker based on the content of two audio and text modalities. The second goal is to condition the source speaker predicted gestures on the multimodal behavior style embedding of a target speaker. The third goal is to allow zero shot style transfer of speakers unseen during training without retraining the model. Our system consists of: (1) a speaker style encoder network that learns to generate a fixed dimensional speaker embedding style from a target speaker multimodal data and (2) a sequence to sequence synthesis network that synthesizes gestures based on the content of the input modalities of a source speaker and conditioned on the speaker style embedding. We evaluate that our model can synthesize gestures of a source speaker and transfer the knowledge of target speaker style variability to the gesture generation task in a zero shot setup. We convert the 2D gestures to 3D poses and produce 3D animations. We conduct objective and subjective evaluations to validate our approach and compare it with a baseline

arXiv.org e-Print Archive

Vers une modélisation continue de la structure prosodique: le cas des proéminences syllabiques

Author: AVANZI MATHIEU
LACHERET-DUJOUR ANNE
OBIN NICOLAS
VICTORRI BERNARD
Publication venue
Publication date: 02/08/2017
Field of study

L'objectif de cet article est de présenter un outil développé en vue de modéliser semi-automatiquement la structure prosodique du français. Sur la base d'un alignement en phonèmes, notre système procède à la détection des syllabes proéminentes en prenant en considération des critères acoustiques basiques tels que la f0, la durée et la présence de pauses. À partir des mesures ainsi prises, le système attribue un degré de proéminence à chacune des syllabes identifiées comme saillante. Nous illustrons ensuite les résultats de l'analyse d'extraits du corpus PROSO_FR. Plus précisément, nous comparons l'analyse prosodique de phrases que l'on pourrait faire avec les règles traditionnelles de la phonologie prosodique avec l'analyse conduite par notre logiciel. Nous discutons ainsi de trois règles: la règle de dominance droite, la règle de clash accentuel et la règle des sept syllabe

RERO DOC Digital Library

Comparaison de trois outils de détection automatique de proéminences en français parlé

Author: Avanzi Mathieu
Goldman Jean-Philippe
Lacheret Anne
Obin Nicolas
Publication venue: HAL CCSD
Publication date: 01/01/2008
Field of study

This paper presents the inner details of three differentalgorithms for prominence detection. On the basis of a 50-minute corpus made of five speaking styles and manuallyannotated for prominence, a quantitative evaluationcompares the three approaches